Introduction

This paper deals with the analysis called association rules. A common example is customer purchase analysis, which is an attempt to guess whether the consumer will buy the product/products Y if he buys product/products X. product/products X.

The Previous research data is taken from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour

The original paper is from my ML project, I would like to improve it by Reproducible Research.

Do reproducibility of association rules on different datasets: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1

library(kableExtra)
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
transactions = read.transactions(
  "Market_Basket_Optimisation.csv",
  format = "basket",
  sep = ",",
  skip = 0,
  header = TRUE
)
transactions
## transactions in sparse format with
##  7500 transactions (rows) and
##  119 items (columns)
itemFrequencyPlot(
  transactions,
  topN = 20,
  type = "absolute",
  main = "Item frequency",
  cex.names = 0.85
)

The figure above shows the twenty most popular purchases. Mineral water comes first, followed by eggs, spaghetti, french fires and chocolate.

Association rules

Global rules calculations

We should start the analysis by creating rules, to do this I will use the Apriori algorithm. Because the algorithm did not find enough rules for the base values of confidence and support, I decided to lower their values to 0.01 (support) and 0.4 (confidence). After calculations, the algorithm found 17 rules.

rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Association rules analysis is a technique to uncover how items are associated to each other.

There are three common ways to measure association: Support/Confidence/Lift.

There are some examples in my git: https://github.com/wzs19961101/final-projects-.git

Support

Support is a measure of how often a certain subset of products appeared in the whole set of transactions. In other words this is the probability of appearing a transaction with all items together. Below are top six rules in terms of support value.

rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
##     lhs                            rhs             support    confidence
## [1] {ground beef}               => {mineral water} 0.04093333 0.4165536 
## [2] {olive oil}                 => {mineral water} 0.02746667 0.4178499 
## [3] {soup}                      => {mineral water} 0.02306667 0.4564644 
## [4] {ground beef,spaghetti}     => {mineral water} 0.01706667 0.4353741 
## [5] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [6] {chocolate,spaghetti}       => {mineral water} 0.01586667 0.4047619 
##     coverage   lift     count
## [1] 0.09826667 1.748266 307  
## [2] 0.06573333 1.753707 206  
## [3] 0.05053333 1.915771 173  
## [4] 0.03920000 1.827256 128  
## [5] 0.04093333 2.394361 128  
## [6] 0.03920000 1.698777 119
rules_supp_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {ground beef} => {mineral water} 0.0409333 0.4165536 0.0982667 1.748266 307
[2] {olive oil} => {mineral water} 0.0274667 0.4178499 0.0657333 1.753707 206
[3] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173
[4] {ground beef,spaghetti} => {mineral water} 0.0170667 0.4353741 0.0392000 1.827256 128
[5] {ground beef,mineral water} => {spaghetti} 0.0170667 0.4169381 0.0409333 2.394361 128
[6] {chocolate,spaghetti} => {mineral water} 0.0158667 0.4047619 0.0392000 1.698777 119

Considering the result with the highest support value (around 4%), which means that 307 transactions out of a total of 7,500 contained ground beef and mineral water. The second means that oliver oil and mineral water transaction was present in 2.7% of transactions. And third soup and mineral water appeared in 2.3% of transactions.

Confidence

Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket. In more formal way it is the estimated conditional probability of seeing Y product/s in a transaction under the condition that the transaction also contains X product/s.

rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
##     lhs                         rhs             support    confidence
## [1] {eggs,ground beef}       => {mineral water} 0.01013333 0.5066667 
## [2] {ground beef,milk}       => {mineral water} 0.01106667 0.5030303 
## [3] {chocolate,ground beef}  => {mineral water} 0.01093333 0.4739884 
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266 
## [5] {soup}                   => {mineral water} 0.02306667 0.4564644 
## [6] {pancakes,spaghetti}     => {mineral water} 0.01146667 0.4550265 
##     coverage   lift     count
## [1] 0.02000000 2.126469  76  
## [2] 0.02200000 2.111207  83  
## [3] 0.02306667 1.989319  82  
## [4] 0.02360000 1.968075  83  
## [5] 0.05053333 1.915771 173  
## [6] 0.02520000 1.909736  86
rules_conf_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {eggs,ground beef} => {mineral water} 0.0101333 0.5066667 0.0200000 2.126469 76
[2] {ground beef,milk} => {mineral water} 0.0110667 0.5030303 0.0220000 2.111207 83
[3] {chocolate,ground beef} => {mineral water} 0.0109333 0.4739884 0.0230667 1.989319 82
[4] {frozen vegetables,milk} => {mineral water} 0.0110667 0.4689266 0.0236000 1.968074 83
[5] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173
[6] {pancakes,spaghetti} => {mineral water} 0.0114667 0.4550265 0.0252000 1.909736 86

Confidence values in the six presented are quite similar (from 45% to 50%). Let’s only analyze the basket with the highest confidence.

The value of confidence says that when buying eggs and ground beef with a probability rate of 51%, the consumer will also buy mineral water.

Lift

Lift is understood as a measure of sorts correlation. Put simply, it says about how likely it is that products X and Y will be bought together or separately.

A value greater than 1 says that products should be bought together, a value less than one says that they should be bought separately.

rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
##     lhs                            rhs             support    confidence
## [1] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [2] {eggs,ground beef}          => {mineral water} 0.01013333 0.5066667 
## [3] {ground beef,milk}          => {mineral water} 0.01106667 0.5030303 
## [4] {chocolate,ground beef}     => {mineral water} 0.01093333 0.4739884 
## [5] {frozen vegetables,milk}    => {mineral water} 0.01106667 0.4689266 
## [6] {soup}                      => {mineral water} 0.02306667 0.4564644 
##     coverage   lift     count
## [1] 0.04093333 2.394361 128  
## [2] 0.02000000 2.126469  76  
## [3] 0.02200000 2.111207  83  
## [4] 0.02306667 1.989319  82  
## [5] 0.02360000 1.968075  83  
## [6] 0.05053333 1.915771 173
rules_lift_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {ground beef,mineral water} => {spaghetti} 0.0170667 0.4169381 0.0409333 2.394361 128
[2] {eggs,ground beef} => {mineral water} 0.0101333 0.5066667 0.0200000 2.126469 76
[3] {ground beef,milk} => {mineral water} 0.0110667 0.5030303 0.0220000 2.111207 83
[4] {chocolate,ground beef} => {mineral water} 0.0109333 0.4739884 0.0230667 1.989319 82
[5] {frozen vegetables,milk} => {mineral water} 0.0110667 0.4689266 0.0236000 1.968074 83
[6] {soup} => {mineral water} 0.0230667 0.4564644 0.0505333 1.915771 173

Analyzing the values of the top six transactions, we can see that for all of them Lift values are higher than one. So we can conclude that rhs products are more likely to be bought with other products (lhs list) than if they were independent. For {ground beef, mineral water} => {spaghetti} rule, items have been seen together in transactions at the 2.39 rate expected under independence between them.

plot(rules, engine="plotly")

Let’s also look at the graph showing the location of the transaction data relative to support (horizontal axis), confidence (vertical axis) and lift (color saturation). Most of the values are arranged in a hyperbolic shape, suggesting that as confidence increases, support decreases. This is mainly due to similarities in the way they are calculated, but thanks to this, some outliers are clearer to see. (such as {soup}=>{mineral water})

Chocolate rules calculation

Let’s say we just want to look at chocolate as our rhs, in simple terms we want to find out what products usually are bought before or together with chocolate. With which products does the consumer like to buy chocolate the most.

rules_chocolate = apriori(
    data = transactions,
    parameter = list(supp = 0.001, conf = 0.7),
    appearance = list(default = "lhs", rhs = "chocolate"),
    control = list(verbose = F)
  )
rules_chocolate_table = inspect(rules_chocolate, linebreak = FALSE)
##     lhs                                                  rhs        
## [1] {red wine,tomato sauce}                           => {chocolate}
## [2] {almonds,olive oil,spaghetti}                     => {chocolate}
## [3] {almonds,milk,spaghetti}                          => {chocolate}
## [4] {escalope,french fries,shrimp}                    => {chocolate}
## [5] {burgers,olive oil,pancakes}                      => {chocolate}
## [6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate}
##     support     confidence coverage    lift     count
## [1] 0.001066667 0.8000000  0.001333333 4.882018 8    
## [2] 0.001066667 0.7272727  0.001466667 4.438198 8    
## [3] 0.001066667 0.7272727  0.001466667 4.438198 8    
## [4] 0.001066667 0.8888889  0.001200000 5.424464 8    
## [5] 0.001200000 0.7500000  0.001600000 4.576892 9    
## [6] 0.001066667 0.7272727  0.001466667 4.438198 8
rules_chocolate_table %>%
  kable() %>%
  kable_styling()
lhs rhs support confidence coverage lift count
[1] {red wine,tomato sauce} => {chocolate} 0.0010667 0.8000000 0.0013333 4.882018 8
[2] {almonds,olive oil,spaghetti} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
[3] {almonds,milk,spaghetti} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
[4] {escalope,french fries,shrimp} => {chocolate} 0.0010667 0.8888889 0.0012000 5.424464 8
[5] {burgers,olive oil,pancakes} => {chocolate} 0.0012000 0.7500000 0.0016000 4.576892 9
[6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate} 0.0010667 0.7272727 0.0014667 4.438198 8
plot(rules_chocolate, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Conclusions

From this project,we could see that Association Rules are an extremely interesting method of data analysis which can relatively easily find out about many interesting relationships. And also, I did Reproducible Research by using same methods for another datasets, which prove the reproducibility of my code.

Original research paper is showing below:

library(arules)
library(arulesViz)
setwd("C:/Users/wangz/Desktop")
md = read.transactions("dataset.csv",format = "basket",
                                sep = ",",skip = 0, header = TRUE)
dim(md)
## [1] 1498   38
#average number of items 
ave_size = mean(size(md));
ave_size 
## [1] 10.34913
summary(md)
## transactions as itemMatrix in sparse format with
##  1498 rows (elements/itemsets/transactions) and
##  38 columns (items) and a density of 0.2723456 
## 
## most frequent items:
## vegetables    poultry    waffles     bagels lunch meat    (Other) 
##        894        431        418        417        413      12930 
## 
## element (itemset/transaction) length distribution:
## sizes
##   3   4   5   6   7   8   9  10  11  12  13  14 
##   8  57  51  51  71  74  95 191 304 320 212  64 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   11.00   10.35   12.00   14.00 
## 
## includes extended item information - examples:
##          labels
## 1  all- purpose
## 2 aluminum foil
## 3        bagels

Check what products appear most/least often, and get visualization and plots

# relative frequency
round(itemFrequency(md, type="relative"),4)
##                 all- purpose                aluminum foil 
##                       0.2630                       0.2637 
##                       bagels                         beef 
##                       0.2784                       0.2623 
##                       butter                      cereals 
##                       0.2610                       0.2737 
##                      cheeses                   coffee/tea 
##                       0.2603                       0.2630 
##                 dinner rolls dishwashing liquid/detergent 
##                       0.2583                       0.2684 
##                         eggs                        flour 
##                       0.2690                       0.2570 
##                       fruits                    hand soap 
##                       0.2637                       0.2377 
##                    ice cream             individual meals 
##                       0.2750                       0.2717 
##                        juice                      ketchup 
##                       0.2577                       0.2503 
##            laundry detergent                   lunch meat 
##                       0.2644                       0.2757 
##                         milk                        mixes 
##                       0.2710                       0.2737 
##                 paper towels                        pasta 
##                       0.2550                       0.2717 
##                         pork                      poultry 
##                       0.2497                       0.2877 
##                sandwich bags              sandwich loaves 
##                       0.2497                       0.2490 
##                      shampoo                         soap 
##                       0.2477                       0.2657 
##                         soda              spaghetti sauce 
##                       0.2737                       0.2543 
##                        sugar                 toilet paper 
##                       0.2670                       0.2704 
##                    tortillas                   vegetables 
##                       0.2443                       0.5968 
##                      waffles                       yogurt 
##                       0.2790                       0.2684
# plot for relative frequency
itemFrequencyPlot(
  md,
  topN = 10,
  type = "relative",
  main = "Item frequency",
  cex.names = 0.85
)

#absolute frequency
itemFrequency(md, type="absolute")
##                 all- purpose                aluminum foil 
##                          394                          395 
##                       bagels                         beef 
##                          417                          393 
##                       butter                      cereals 
##                          391                          410 
##                      cheeses                   coffee/tea 
##                          390                          394 
##                 dinner rolls dishwashing liquid/detergent 
##                          387                          402 
##                         eggs                        flour 
##                          403                          385 
##                       fruits                    hand soap 
##                          395                          356 
##                    ice cream             individual meals 
##                          412                          407 
##                        juice                      ketchup 
##                          386                          375 
##            laundry detergent                   lunch meat 
##                          396                          413 
##                         milk                        mixes 
##                          406                          410 
##                 paper towels                        pasta 
##                          382                          407 
##                         pork                      poultry 
##                          374                          431 
##                sandwich bags              sandwich loaves 
##                          374                          373 
##                      shampoo                         soap 
##                          371                          398 
##                         soda              spaghetti sauce 
##                          410                          381 
##                        sugar                 toilet paper 
##                          400                          405 
##                    tortillas                   vegetables 
##                          366                          894 
##                      waffles                       yogurt 
##                          418                          402
#plot for absolute frequency
itemFrequencyPlot(
  md,
  topN = 10,
  type = "absolute",
  main = "Item frequency",
  cex.names = 0.85
)

The figure above shows the 10 most popular purchases. Vegetables is first, then poultry and waffles.

#Plot for min support
itemFrequencyPlot(md, support = 0.1) #minimum support at 10%

Association rules

Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association: Support/Confidence/Lift

Global rules calculations

I use the Apriori algorithm. To simplify the analysis, I used the values: Confidence = 0.4, support = 0.1 After calculations, the algorithm found 38 rules.

rules = apriori(md, parameter = list(supp = 0.1, conf = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 149 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1498 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support

Support is a measure of how often a certain subset of items appeared in the whole data.

rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {}                  => {vegetables} 0.5967957 0.5967957  1.0000000 1.000000
## [2] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [3] {poultry}           => {vegetables} 0.1748999 0.6078886  0.2877170 1.018587
## [4] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [5] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [6] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
##     count
## [1] 894  
## [2] 264  
## [3] 262  
## [4] 259  
## [5] 257  
## [6] 255

Confidence

Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket. In more formal way it is the estimated conditional probability of seeing Y product/s in a transaction under the condition that the transaction also contains X product/s.

rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [3] {eggs}              => {vegetables} 0.1695594 0.6302730  0.2690254 1.056095
## [4] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [5] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
## [6] {flour}             => {vegetables} 0.1595461 0.6207792  0.2570093 1.040187
##     count
## [1] 264  
## [2] 259  
## [3] 254  
## [4] 257  
## [5] 255  
## [6] 239

Lift

Lift is understood as a measure of sorts correlation. Put simply, it says about how likely it is that products X and Y will be bought together or separately. A value greater than one says that products should be bought together, a value less than one says that they should be bought separately.

rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
##     lhs                    rhs          support   confidence coverage  lift    
## [1] {yogurt}            => {vegetables} 0.1762350 0.6567164  0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404  0.2643525 1.095920
## [3] {eggs}              => {vegetables} 0.1695594 0.6302730  0.2690254 1.056095
## [4] {lunch meat}        => {vegetables} 0.1715621 0.6222760  0.2757009 1.042695
## [5] {cereals}           => {vegetables} 0.1702270 0.6219512  0.2736983 1.042151
## [6] {flour}             => {vegetables} 0.1595461 0.6207792  0.2570093 1.040187
##     count
## [1] 264  
## [2] 259  
## [3] 254  
## [4] 257  
## [5] 255  
## [6] 239

we can see the result, for all of them, Lift values are higher than 1. So we can say that rhs products are more likely to be bought with other products (lhs list) than if they were independent.

plot(rules, engine="plotly")

Change rhs to another product–Ice cream rules calculation

In our data, vegetables is the most frequent product in the basket analysis, we cannot observe any rules. So let’s use another product as our rhs: I will take Ice cream

rules_ice_cream = apriori(
    data = md,
    parameter = list(supp = 0.01, conf = 0.4),
    appearance = list(default = "lhs", rhs = "ice cream"),
    control = list(verbose = F)
  )
rules_ice_cream_table = inspect(rules_ice_cream, linebreak = FALSE)
##      lhs                                                  rhs        
## [1]  {hand soap,spaghetti sauce,vegetables}            => {ice cream}
## [2]  {cereals,paper towels,sandwich loaves}            => {ice cream}
## [3]  {all- purpose,lunch meat,spaghetti sauce}         => {ice cream}
## [4]  {aluminum foil,pasta,spaghetti sauce}             => {ice cream}
## [5]  {dishwashing liquid/detergent,flour,paper towels} => {ice cream}
## [6]  {aluminum foil,paper towels,soda}                 => {ice cream}
## [7]  {aluminum foil,coffee/tea,soda}                   => {ice cream}
## [8]  {aluminum foil,juice,milk}                        => {ice cream}
## [9]  {aluminum foil,beef,yogurt}                       => {ice cream}
## [10] {aluminum foil,beef,vegetables}                   => {ice cream}
## [11] {aluminum foil,milk,toilet paper}                 => {ice cream}
##      support    confidence coverage   lift     count
## [1]  0.01001335 0.4054054  0.02469960 1.474023 15   
## [2]  0.01001335 0.4838710  0.02069426 1.759317 15   
## [3]  0.01001335 0.4054054  0.02469960 1.474023 15   
## [4]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [5]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [6]  0.01001335 0.4838710  0.02069426 1.759317 15   
## [7]  0.01134846 0.4594595  0.02469960 1.670559 17   
## [8]  0.01001335 0.5000000  0.02002670 1.817961 15   
## [9]  0.01001335 0.4545455  0.02202937 1.652692 15   
## [10] 0.01802403 0.4576271  0.03938585 1.663897 27   
## [11] 0.01134846 0.4358974  0.02603471 1.584889 17

Due to fewer transactions of this type, I reduce the initial support value to 0.01.

Due to the small sample, there is no clear pattern between the results of the analysis

plot(rules_ice_cream, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_ice_cream, method="graph") 

In this paper, I used mainly Apriori method for association rules. Despite results is not very good, I think that Association Rules are an interesting method of data analysis.